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Abstract We propose a Bayesian model for mixed ordinal and continuous multi- 
variate data to evaluate a latent spatial Gaussian process of wetland condition. Our 
proposed model can be used in many contexts where mixed continuous and discrete 
multivariate responses are observed in an effort to quantify an unobservable con- 
tinuous measurement. In our example, the latent, or unobservable measurement is 
wetland condition. While predicted values of the latent wetland condition variable 
produced by the model at each location do not hold any intrinsic value, the relative 
magnitudes of the wetland condition values are of interest. In addition, by including 
point-referenced covariates in the model, we are able to make predictions at new 
locations for both the latent random variable and the multivariate response. Lastly, 
the model produces ranks of the multivariate responses in relation to the unobserved 
latent random field. This is an important result as it allows us to determine which 
response variables are most closely correlated with the latent variable. Our approach 
offers an alternative to traditional indices based on best professional judgment that 
are frequently used in ecology. We apply our model to assess wetland condition in 
the North Platte and Rio Grande River Basins in Colorado. The model facilitates a 
comparison of wetland condition at multiple locations and ranks the importance of 
in-field measurements. 
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1 Introduction 



Latent variable modeling has become common practice in a variety of scientific re- 
search fields where the latent variables are not directly observed but instead inferred 
from other values that are observed. These models are particularly relevant when the 
observed data are assumed to be driven by some underlying, unobservable process. 
Often times in the biological and ecological sciences, for example, multiple mea- 
surements are reported for each sampling unit or at each sampled location within a 
spatial domain and the goal is to understand the underlying latent variable(s) gen- 
erating the measurements. Here, these measurements make up a multivariate re- 
sponse. In spatial statistics, a latent variable could be used to model a random field, 
or process. Chakra borty et al.| ( |2010| ) applied a latent spatial process model to model 
species abundance across a large region of South Africa. jChristensen and Amemiya 



(2002) developed a general framework for multivariate latent variable models that 
incorporates spatial correlation among the latent variables. 

We focus on ordered categorical, or ordinal data where measurements for each 
observation are reported on a specified scale, (e.g., low, medium, high). Some dis- 
crete data are ordinal in nature. For example, in survey data, respondents are asked 
to characterize their opinions on a Likert scale ranging from strongly disagree to 
strongly agree. In other situations, data will be ordinal when a researcher reports 
the response as a discretized continuous variable instead of as the actual continuous 
variable due to constraints on the data collection process. This may be the case when 
reporting sediment size in streams or surface area of leaves on individual plants, es- 
pecially when the data are to be collected over a large spatial domain. 

We propose a model for drawing inference about mixed ordinal and continu- 
ous multivariate response data. We refer to the model as a multilevel latent process 
model because we introduce latent variables at two levels within the hierarchy. The 
first level of latency is introduced by assuming there is a continuous latent process 
that generates each variable of the multivariate response. The model extends the 
multivariate latent health factor model proposed by Chiu et al. (2011 1 by allowing 
dependence on the site effect to vary across response variables. 

The second level of latency is introduced by assuming there exists an underlying 
univariate latent spatial process, or latent random field, that is generating the multi- 
variate response. We assume a linear relationship between each of the latent continu- 
ous response processes (first level of latency) and the latent spatial process (second 
level of latency). Refer to Figure [T] for a diagram of the multilevel latency. This 
model provides estimates of the latent spatial process in order to compare different 
locations within a specified region of interest. Second, the model allows quantifica- 
tion of the relationship between the spatial latent variable and each of the variables 
of the multivariate response. Lastly, we can determine which of the variables of the 
multivariate response are most closely associated with the latent spatial process. In 
doing so, we can establish weights for each of the response variables to be used in 
weighted averaging for estimating the underlying latent spatial process. By incorpo- 
rating point-referenced covariate information, we can predict the value for the latent 
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spatial variable as well as the mixed ordinal and continuous multivariate response at 
new locations. 

In Section|2]we motivate the model with an application of assessing the condition 
of wetlands in Colorado. In Section[3]we introduce the mixed ordinal and continuous 
multivariate latent Gaussian process model; we also describe methods of inference 
and estimation of the model parameters under the Bayesian framework. In Section 
[4]we develop methods to predict the latent random variable and ranking procedures 
for the multivariate response. The methodology is applied in Section [5] through the 
evaluation of wetland condition in the North Platte and Rio Grande River Basins 
of Colorado. Section [6] concludes with a brief discussion and recommendations for 
future work. 



2 Motivating example 

The proposed model was motivated by a program to asses the condition of wetlands 
in Colorado. Limited data exist on the location, type, and condition of Colorado's 
wetlands hindering wetland management. The long-term viability and integrity of 
Colorado's wetland resources are threatened due to increased demand from major 
urban areas for water development and storage projects, growth in the oil and gas 
industry, and changes in forest health (Dahl] [20TT) . The data considered here were 
collected through a partnership between Colorado Parks and Wildlife (CPW)'s Wet- 
lands Program and the Colorado Natural Heritage Program (CNHP) to assess the 
condition of wetlands in Colorado. The specific data used in this model were col- 



lected in Colorado's North Platte and Rio Grande River Basins (Lemly et al. 201 1 



Lemly and Gillian 2012|). One of the major goals of the CPW-CNHP partnership 



is to model the spatial distribution of wetland ecological condition throughout each 
river basin in the state. Our goal was to improve spatial modeling techniques in order 
to help land managers effectively maintain and improve critical wetland habitats. 

In order to implement effective wetland protection strategies and to establish 
restoration and management plans, wetlands must be assessed and then poten- 
tial threats or stressors identified. There are many different in-field measurements, 
known as metrics, that reflect various aspects of wetland condition. These metrics 
can be of any variable type including continuous, count data, ordinal, etc. Over- 
all scores that are computed based on multiple measurements are referred to as 
multi-metric indices. When the metrics are of the same variable type, one index to 
evaluate overall wetland condition is an average metric score. However, difficulty 
arrises when trying to compute an index that encompasses metrics of different vari- 
able types. In this work, we propose using continuous latent variables as consistent 
measures across all metric types. Appropriate link functions can map the continuous 
latent variables to the different metrics. 

One popular index that incorporates 12 metrics to evaluate ecological condition 
is the index of biotic integrity, or IBI ( |Karr|[T98T) . It is of great interest to ecologists 
to determine whether the particular metrics that are used in computing the IBI are 
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useful in evaluating wetland condition. Of equal importance, ecologists are inter- 
ested in identifying which of the measurements taken during in-field data collection 
are most representative of overall wetland condition. This is beneficial as it will not 
only increase accuracy in gauging wetland condition but will also save time and 
resources for future data collection by requiring fewer measurements. 

There are tens of thousands of acres of reported wetlands in Colorado's North 
Platte and Rio Grande River Basins and sampling time and resources are limited. 
One of the major goals of the wetland profiling project is to model the spatial dis- 
tribution of the ecological condition of wetlands throughout the basins and deter- 
mine the optimal metrics for measuring key habitat features for wetland-dependent 
wildlife species. We compare the ecological condition of the wetlands based on five 
metrics in both the North Platte and Rio Grande River Basins. 



3 Model and inference 

3.1 Multivariate mixed response data 

One of the main goals of this work is to use observed mixed ordinal and continuous 
multivariate responses from a finite number of point-referenced locations to draw 
inference on an underlying latent spatial process. We wish to make predictions of 
the latent spatial process as well as quantify uncertainty. The model consists of first 
representing each of the multivariate response variables as a continuous response. 
For the ordinal response variables, this continuous response is latent. We then define 
a linear relationship between each of the (latent) continuous response variables and 
the underlying latent spatial process of interest. We assume that each of the response 
variables contains information about this latent spatial process. Refer to Figure[T]for 
a diagram of the multilevel latent model. 

For the spatial domain of interest, D, define {Y(s) = \Y\ (s) , . . . , Yj(s)] ,s G D} as 
a mixed ordinal and continuous multivariate random field at location s having / re- 
sponses. Each response at location s, {fj(s),s for j — 1, ... ,7 is modeled by a 
random field of either continuous or ordinal values. Let J c denote the number of con- 
tinuous response variables and J denote the number of ordinal response variables, 
where J (t > 1 . Therefore, / = J +J c . For all ordinal variables variables j in 1 , . . . ,/ , 
the observable response T/(s) G {1, . . . ,K} for every location s. The model can eas- 
ily be generalized to include observable response variables with varying number of 
categories, e.g. Yj(s) G {1,. . . ,Kj}. In such a case, parameter constraints, discussed 
below, will need to be modified to maintain model identifi ability. 

We assume there exists an underlying continuous multivariate Gaussian process, 
{Z(s) = [Z\ (s), . . . ,Z/(s)],s G D}, over the region of interest that is generating Y(s). 
Dropping the dependence on s for ease of notation, we denote Y = [Yi,...,Y/] 
and Z = [Zi , . . . , Zy] where Y ; - and Z ; are the j lh observable response and under- 
lying continuous Gaussian process, respectively. For j = 1 , .... J, we define Fj as 
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the mapping of the continuous variable Zj to the observable response Yj. Whereas 
the observable response data presented in this work are continuous and ordinal, the 
model holds for other types of response variables, e.g. binary, Poisson, etc. The 
mapping function Fj can take on any form as long as it is reasonable to assume that 
an underlying continuous Gaussian process is generating the response. For an ordi- 
nal response, the continuous variable Zj is latent. Here, the mapping Fj is defined as 
a function with parameter vector ki, a (K+ 1) x 1 dimension vector of thresholds, 
that assigns the latent continuous random variables Z ; into the ordered categories 
1, . . . ,K of the observable data Y j (Muthen 1984[ l. The threshold parameter vector 



is constrained such that — °° = A^o < < . . -kj,K = °° for each ordinal metric. We 
define a mapping, Fj, of Z/(s) to Yj(s) as 

K 

I';(s) = F ; -(Z y (s) ) A J -) = £w {X . t _ 1<z . (s) < AM} , j = l,...,J , seD. (1) 

k=l 

For continuous response variables, the mapping Fj is taken as the identity function 
since Zj would be observed directly. 



3.2 Multilevel latency 

We assume that the latent random process is expressed by a mixed model. For the 
j lh random process, Zj, we assume a multivariate Gaussian process where 

Zj ~GP(9jl + cOjK, ajl) . (2) 

We define the mean of each Zj as a metric-specific linear combination of the 1- 
vector and a latent random field H. The latent random field H is the process of 
interest and encompasses the latent measure of wetland condition in our application. 
The fixed effect 9j is the intercept for metric / and the fixed effect C0j is the factor 
loading of the spatial random field H. Both and CO are 1 x J dimensional vectors. 
The parameter CO allows us to quantify the relationship between each of the response 
variables and H. The variance of Zj is specific to each metric j, which we define as 
(7?I where I is the identity matrix. For j ^ I, Zj and Z/ are conditionally independent 
given H, 9, and co. 

The spatial dependence of the multivariate random field is modeled through the 
latent spatial process, H. Note that the inclusion of the additional latent process H 
makes this a multilevel latent process model. We assume this latent spatial process 
is driving the mixed ordinal and continuous multivariate observable response, Y. 
Therefore, H provides a univariate summary measure for each location from which 
we will draw inference across space. We assume H to be a Gaussian process with 
covariates in the mean structure and a covariance matrix defined by a spatial corre- 
lation function. Let 

n~GP(xp,E„(<l>)) (3) 
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where X contains p location-specific observable covariates and j8 is a p X 1 vector 
of coefficients. The covariance matrix £#(0) is described by a function Eu(<j>) = 
p (| |Sj — S/ 1 1 ; (j>) where p is a covariance function with parameters (j) that produces a 
valid covariance matrix depending only on the spatial distance matrix. 




Fig. 1 Diagram of a multilevel latent model with two ordinal observable response variables Yi 
and Y2 and a continuous observable response variable Y3. Here, Zi,Z2, and Z3, represent the first 
level of latency as the latent continuous response variables. H is the second level of latency and is 
the latent spatial random field of interest. There are 4 covariates in the model, Xi,X2,X3, and X4. 
The □ indicates an observable value and the O indicates a random variable. Additional parameters 
are shown next to the links. 



3.3 Bayesian framework 

The observed multivariate data matrix y is of dimension nxJ where n is the number 
of point-referenced locations in our sample and J is thel number of metrics or re- 
sponses at each location. For i=l,...,n and ordinal response variables j = 1, . . . ,J , 
the density of yij is the integral from A ; -. V(7 _i to Xj t y.. of the normal distribution de- 
fined for Zij. Whereas we first defined Zj as a Gaussian process for each j=l,...,J, 
realizations of these processes have a multivariate normal distribution. Denoting the 
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multivariate ordinal observed response y„ = [yi, . . . ,yj a ), we write the likelihood 
of the /* vector of J„, y ,-, as the integral of an n-dimensional multivariate normal 
distribution. Therefore 



Po (y j \H,9 J ,co j ,X j ,a 2 ) = ( mi ■■■( lyni (27r)-"/ 2 |c7/l„r 1 /2 

x exp|-^[Z ; - (flyl + cOjH)}'[ajl„}- 1 [Zj - (0/1 + fi) ; H)]|rfZ ; . 



(4) 



For the multivariate continuous observed response y c = [yi, . . . ,jj c ] , the likelihood 
of the /* vector of y c is the multivariate normal density, p c . Denoting y = [y ,yc]> 
the likelihood for all observations is given by 

Jo Jc 

p{y\H,9,co,X,a 2 ) = Ylp (yj\H.,6j,CDj;Xj,oj) x Y[p c {yj\fl,6 h a)j,aj) 

j=i j=i 

We define prior distributions for all model parameters and latent random variables to 
complete the Bayesian model specification. We aim to assign proper yet vague prior 
distributions to unknown parameters to maintain generality of the model. When 
applicable, conjugate priors are assigned to ease computational complexity. 

To ensure identifiability of the intercept parameter vector 0, it is necessary to 
place a restriction on one of the threshold parameters. Where the lower and upper 
cut points are defined as Xj ,o = — °° and Xjx = °°, we assume without loss of gen- 
erality that A/ i =0 for 7 = 1,... ,J , Therefore, we are left to estimate J x (K — 2) 
threshold parameters. A uniform prior can be assigned to the cut parameters as 
shown in Albert and Chib 
k = 2, . . . ,k— 1 and j =1,...,J 
to poor mixing in the Markov Chain. We transform the parameter A/,i , . . . , A,- j_ i to 



{19931, where p(Xj, k \Xj. k _ u X ktk+ i) °c for 
However, the constraint that A/^_i < A,- * can lead 



a new space with parameters a^i, . . . , (Xj,k-X (Albert and Chib 19971. The trans- 
formation is performed by setting a,.i = A; J = 0, aTj = log(A ; - 2)1 an d letting 
(Xjk — l°g(A/,t — A; k -\) for k — 3,...,K — 1. The inverse transformation is ex- 
pressed as Xj_ k = Lf = 2<?°^'. We then impose an unrestricted multivariate normal 
prior distribution to the (K — 2) X 1 dimension vector a for each j = l,...,J u with 
mean a and covariance matrix A. 

As denoted above, each of the latent response vectors Zj for j = l,...,7is a 
Gaussian process with mean 0,1 + (OjH and covariance matrix ajl. Due to the mul- 
tivariate multilevel latent structure of the model, some parameters will be fixed to 
ensure identifiability of the other parameters of interest. When the threshold vectors 
are metric-specific, as shown in (jl|, the scale parameter, aj, for j — l,...,J of 
the covariance of the continuous multivariate random variables Zj will have to be 
fixed ( Skrond al and Rabe-He sketh 2004). When all of the ordinal metrics have the 
same number of categories, the threshold parameter vector A can be assumed the 
same across all metrics. In this case, the parameter a 2 is identifiable for the ordinal 
metrics if just one element, aj is fixed. Fixing thresholds to be equal for all met- 
rics is not overly restrictive when the number of categories of the ordered response 
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is small. Indeed, it can be helpful when some of the metrics have few responses 
in some categories. Also, the mean and variance of the latent continuous response 
are able to vary across metrics which allows the model to be flexible. However, 
this assumption becomes more restrictive as the number of categories per metric 
increases because the model may not be sufficiently flexible to preserve the pro- 
portions in each category for the different metrics. Without loss of generality, we 
set the variance of the first ordinal response variable, of = 1 and drop the metric 
dependence on the thresholds. The remaining parameters, aj for j = 2, . . . ,J, are 
assigned inverse-Gamma prior distributions with hyper-parameters a, and b z . 

The mean of the distribution of the latent process H is defined as Xj3, where 
the covariate matrix X is centered and scaled and does not include the one vector 
in order to estimate 9 in |2]). The conjugate prior distribution for the p x 1 vector 
j3 is MVN(0,OpI p ). Let Eu((j>) be the covariance of the distribution of H where 
the vector represents the parameters of the covariance function. Here we choose 
an exponential covariance function and write p(s; — S/;0) = (j>iexp~ dil ^ 2 where du 
represents the Euclidean distance between locations i and /. The conjugate inverse- 
Gamma prior distribution is assigned to 0i and a Gamma prior distribution is as- 
signed to 02- The shape and scale hyper-parameters of these distributions are GUj 
and bfa and and b^,, respectively. For identifiability, however, <j>i is set to 1 
when all response variables are ordinal. Specification of the prior distribution of (j>2 
and its corresponding hyper-parameters can be challenging and must be chosen with 
careful consideration to keep it non-informative, (see e.g., Sch midt et al.| |2008 ). 

The parameters 6 and CO are each assigned a multivariate normal prior distribu- 
tion with mean vector and covariance matrix (J 2 Ij. The scale parameters of both 
covariance matrices, a 2 , and (7^, are chosen to be large such that the prior distribu- 
tions are vague. To ensure identifiability of the model parameters one value of the 
1x7 dimension vector co must be fixed. Without loss of generality we set (0\ = 1 . 
Fixing C0\ establishes a point of reference for the relationship between Z and the 
parameter of interest, H. 



3.4 Inference 

We make inference about the parameters of the model using the Bayesian paradigm 
incorporating Gibbs and Metropolis Hastings sampling techniques. This approach 
allows estimation of both the model parameters and the multilevel and multivari- 
ate latent variables, as well as their uncertainty. Due to the constrained threshold 
parameter vector A, the model proposed in this work is computationally complex. 

The joint posterior distribution of the unknown parameters of interest and the 
latent variables given the observed data can be factored and written as 



p(Z,0,a),H,A,a 2 ,j3,0b)c<p(3;|Z,0,(o,H,A,(j 2 ,j3,0)p(Z|0,(o,H,A,2:,j3^) 

x / ,(/f|j3,0) / ,(0,a),A,(j 2 ,j3,0) 
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where p(y\-) is the distribution of the mixed ordinal and continuous multivariate 
random variables given the model parameters and latent variables, p(Z\-) is the 
conditional distribution of the continuous latent random variable, /?(H|j3,0) is the 
distribution of the latent spatial field of interest, and p(9 ,0),X,o , /3 , ) is the joint 
prior distribution for the parameters 9 , (0 , A , <J 2 , /3 , and . 

The Markov chain Monte Carlo (MCMC) algorithm proceeds as follows: 

1. Update the spatial covariance scale and range parameters, 0i and 02, respec- 
tively. 0i can be drawn drawn directly from its complete conditional distribution 
whereas 02 requires a Metropolis-Hastings step to sample from its complete con- 
ditional distribution. 

2. Update the regression parameter vector j3 and the latent spatial multivariate nor- 
mal, H, from their complete conditional distributions. 

3. Update the metric-specific parameters 9 and (0 and variance parameter a 2 each 
in block form from their complete conditional distributions. 

4. Update the threshold parameters, A by drawing a from p(a\y ,Z ) and inverse 



mapping to get A. See Higgs and Hoeting (2010 1 for explicit details on the repa 



rameterization and updating scheme for A. 
5. Update the latent multivariate normal Z„ from the complete conditional distribu- 
tion. 

The samples from the posterior distribution can then be used to draw inference on 
both the model parameters and latent variables. 



4 Posterior inference 
4.1 Posterior prediction 

The model can be used to make predictions for the mixed ordinal and continuous 
multivariate response as well as the underlying latent spatial process at unobserved 
locations. The multivariate response at m unobserved locations will be denoted 
Y= [Yi,...,Yy] where Yj = [Y\j, . . . ,Y„,jY . Similarly, predictions of the latent spa- 
tial process at the m unobserved locations will be written as H = [Hi,... ,H m ]'. Pre- 
dictions can be made using the Bayesian posterior predictive distributions p(Y|y) 
and /?(H|y) for the multivariate response and latent spatial process, respectively. 

In most applications, the value of the latent variable Hi at location i will be incon- 
sequential but the comparison of H across locations may be of interest. For example, 
wetland condition encompasses many variables. If a latent variable Hi summarizes 
wetland condition at site i, comparisons among sites will be useful to many agencies 
and individuals. For each location, we obtain draws from the distributions Hj\y and 
H,\y for each iteration of the Markov chain. We then examine the distribution of the 
posterior ranks for each location to draw inference and conduct comparisons across 
the region of interest. 
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Other model parameters of particular interest include the parameters of the latent 
spatial field H, j3 and <j), as well as the metric-specific parameters of Z,0), and a 2 . 
Estimating the parameter vector of coefficients of the linear model, j5, enables us 
to evaluate the relationship between the point-referenced covariates and the latent 
random variable H. The unaccounted for spatial correlation of the latent random 
variable H can be estimated by drawing inference on 0i and 02 as well as the effec- 
tive range, 3/02- The effective range is the distance at which the correlation function 
does not exceed 0.05 times the variance. 



4.2 Multivariate correlation statistics 

We estimate the relationship between latent variables Z = [Zi,...,Z/] and H by 
computing multiple correlation values. Due to the deterministic relationship be- 
tween latent Z and observed Y, we assume that the relationship we are estimating 
will capture that of the relationship between H and the multivariate response Y. This 
is a method used in canonical correlation analysis to evaluate the level of linear rela- 
tionship between two sets of variables (Rencher, 2002i It is useful to first partition 



the covariance matrix of the matrix Z and vector H as 

Szz Szh 

where Szz is the 7 x 7 sample covariance matrix of Z, Szh is the 7x1 matrix 
of sample covariances between Z and H, and shh is the sample covariance of H. 
The (j,f) element of Szz is the covariance between the n x 1 dimension vectors 
Z,j and Z,v. Similarly, the /* element of Szh is the covariance between the n x 1 
dimension vectors Zj and H. A measure of association between Z and H as a whole 
is R\[ = IS^Sz/zS^S/zzI- This value is analogous to R 2 in linear regression. This 
value can also be expressed in terms of the canonical correlations between Z and H. 
However, we would like to evaluate the correlation between each of the responses 
and H separately. The correlation between Z ; - and H is defined as the square root of 



diag(Szz); (5) 

where (Szn)j is the /" element of the 7x1 vector Szh- 

We evaluate the multiple correlation for each metric using the posterior simula- 
tions. Therefore, at each simulation draw of the model parameters, we first compute 
the covariance matrix S. Then, for j = 1, ... ,7, we compute the correlation between 
the posterior draw of Z ; and H using (|5jl. Larger values of Rzj\h (i- e -> closer to 1) 
suggest that metric j is more correlated with the underlying latent variable H. In ap- 
plication, a large Rzj\h value means that metric j is a good measurement or predictor 
for the unobserved latent spatial process. We use the multiple correlation values to 
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rank the importance of each of the response metrics in measuring the latent spatial 
process of wetland condition. 



4.3 Model evaluation 

Mixed ordinal and continuous multivariate response models present a unique prob- 
lem for model evaluation. Whereas there are multiple methods to measure predictive 
ability for discrete response models or continuous response models, the difficulty 
arises when we wish to compare mixed response models with both continuous and 
discrete variables. Multicategory loss functions like those presented in Higgs and 
Hoeting ( 2010| l cannot be applied when J c ^ 0. Therefore, we direct our attention to 



loss functions for continuous data since we have a continuous latent variable for all 
J metrics. In the Bayesian framework, the loss is computed by comparing the true 
value to draws from the posterior predictive distribution. Therefore, we first need to 
determine the "true" value for the ordinall variable on the continuous scale. The pos- 
terior mean or median of the latent continuous response could be used as the "true" 
value but we feel this favors the discrete response metrics. We propose setting the 
"true" value for the continuous representation of the observed ordinal variable y as 
the value Z such that 



1 2 



4Z-I1, 

exp 2 "' dZ 




- = 0.50 (6) 
dZ 



where jj, z and a£ are the mean and variance of the posterior distribution of Z, re- 
spectively. Therefore, Z is the 50' /! percentile of the estimated normal distribution 
between the thresholds Ay_i and ky. We can estimate both fi z and a} for i— l,...,n 
and j = l,...,J Q using the posterior draws of the parameters (Oj, 0j,Hj and aj. We 
apply this method to perform model comparison in Section [5] under squared error 
loss. 



5 Assessing wetland condition 
5.1 Data and model specification 

The data were collected at 95 locations within the North Platte River Basin and 137 
locations within the Rio Grande River Basin, resulting in n = 232 locations (Figure 
[2j. The surveyed parcel consisted of a 0.5 hectare area around each target location. 
These locations were sampled randomly using a Generalized Random Tessellation 



Stratified (GRTS) survey design (Stevens and Olsen 20041. Details of the GRTS 



12 



Erin M. Schliep and Jennifer A. Hoeting 



design differed between the basins ( |Lemly et al.||20lT]|Lemly and Gillian] |20 12} . 
We applied the multivariate multilevel latent Gaussian process model to each river 
basin separately and to the basins together and reached similar conclusions. The 
results presented here are those from the river basins modeled together as one data 
set. 

The data include measurements to evaluate the biotic integrity of the wetland, 
as well as the surrounding landscape, soil, and water conditions. Here we apply 
our multilevel latent model to evaluate the biotic integrity of wetlands. We refer to 
the biotic integrity as a proxy for wetland condition because it's the biotic condi- 
tion that drives the overall condition of the wetland. Five measurements, or metrics, 
were derived from detailed vegetation surveys conducted at each field location. The 
five metrics include native plant cover, noxious weed cover, aggressive native cover, 
structural complexity, and floristic quality assessment ( |Lemly and Gillian 20121. 
It is assumed that each of these metrics represents a component of the biotic in- 
tegrity of the wetland. It is current practice for wetland condition assessment to use 
a method of weighted averages to evaluate the biotic condition using these metrics. 
Whereas these weights are often thought to be assigned based on best professional 
judgement or without statistical support, our goal is to use the data within the multi- 
variate multilevel latent Gaussian process model to rank the metrics in a hierarchy of 
most important to least important to assess wetland condition. We can then identify 
a subset of the metrics that are most valuable for future data collection. 

Each metric was reported on a five-category ordinal scale from "poor" to "excel- 
lent," to which we assign integer values from 1 to 5, respectively (Appendix |7.1[ ). 
The floristic quality assessment, native plant cover, noxious weed cover, and ag- 



gressive native cover metrics are discretized continuous variables (See|Lemly and 



Gillian (2012 1 for more details on discretization). The floristic quality metric evalu- 



ates the overall floristic quality and fidelity of the plant community at each location 
to natural, or undisturbed, conditions ( |Rocchio| [2007 ). Each species in the Colorado 
flora has been assigned a coefficient of conservatism (C value: 0-10) that reflects 



the species tolerance of intolerance to disturbance (Swink and Wilhem 1994 Taft 



et al. 1997 1. The continuous value is an average of C values assigned to the plants 



present at the wetland site. The ordinal value at each location is assigned by apply- 
ing a threshold to the continuous metric value. However, this thresholding scheme 
is dependent on wetland type because the natural vegetation differs between wet- 
land type with some naturally containing plant species with lower values of floristic 
quality. Structural complexity is Likert-like and has no tangible underlying contin- 
uous variable. Here, we fit a discrete-only model with J = 5 and J c = as well 
as a mixed response model with J u = 4 and J c = 1 where the continuous metric is 
floristic quality and compare the results. For all J ordinal responses, the observed 
value Y, € {1, . . . ,K = 5} for i=l,.. . ,232. 

The variance of Z,i is fixed and held constant across all locations at of = 1 for 
model identifiability. The hyperparameters of the inverse-gamma distributions of 
the metric specific variance parameters o? are a z =b z = \ for j = 2, . . . ,5. The 
metric specific parameters 9 and co are of dimension 1x5. We set the variance 
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Fig. 2 The n — 232 locations of observed data within the North Platte and Rio Grande River 
Basins. 



hyperparameters Og = a% = 100. For identifiability of the coefficient vector /3, we 
fix a>i = 1. 

Elevation and percent of closed tree canopy vegetation are two continuous point- 
referenced covariates used to model the mean of the Gaussian process H (Appendix 
|7.1| >. We also included wetland type as a categorical covariate with five levels: ripar- 
ian shrublands and woodlands, saline wetlands, marshes, wet meadows, and fens. 
The prior distribution of the coefficient vector j3 is MVN(0, CJ^Ip) with ct| = 100 
and p = 6. The exponential covariance function for the latent random variable H is 
defined as 0i exp _rf ' 7< ^ where d$ is the Euclidean distance between locations i and 
In the mixed response model, we assign an Inv.Gamma(l, 1) prior for 0i and fix 
01 = 1 in the discrete-only model. In both models, 02 is assigned a Gamma(2,2) 
prior distribution. The prior of 02 was chosen such that the effective range, 3/02, 
could reach the maximum distance between sites. 



5.2 Model results 

The Markov chain Monte Carlo algorithm was run for 100,000 iterations using R 



software (R Development Core Team 2007 1. The first 10,000 iterations for both 



models were discarded as burn-in. We ran multiple chains from different starting 
values to evaluate convergence of our Gibbs sampler. The Gelman (2004) potential 
scale reduction factor for each parameter was below 1.2. Similarly, other standard 
diagnostics showed no indications of lack of convergence. 

The posterior estimates from both the discrete-only model and the mixed re- 
sponse model indicate that wetland condition scores are higher for locations at 
higher elevations and with higher percentages of closed tree canopy (Table [T] for 
discrete-only response model, Table|2]for mixed response model). The coefficients 
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ft, ft, ft, and fa represent the effect for saline, marsh, wet meadow, and fen wet- 
land types, respectively, relative to riparian shrublands and woodlands. These values 
vary greatly between models due to the discretization of the fioristic quality assess- 
ment metric. The discretization process includes additional information about the 
condition of each site based on its wetland type and thus, the ordinal values for this 
metric are not uniformly assigned across all locations ( |Lemly and Gillian] |2012| ). 
For example, a riparian wetland with a fioristic quality value of 5.6 on the continu- 
ous scale would be assigned a 4 on the ordinal scale, whereas a marsh wetland with 
the same continuous value would be assigned an ordinal value of 5. For this reason, 
the coefficients for marsh and saline wetland type vary between the two models. 

All estimates of the factor loading Q 0) are positive indicating that the linear 
relationship between latent wetland condition and each of the individual metrics is 
positive (Tables [T] and [2]). Based on the 95% credible intervals these estimates are 
all significantly different from 0. 

The estimates of effective range of spatial correlation for the two models are 
comparable at 88 and 67 km. The overall maximum distance between the 232 ob- 
served locations is d max = 470 km whereas the maximum distance within the North 
Platte and Rio Grande River Basins is 93 km and 202 km, respectively. The mini- 
mum distance between sampled locations from the two river basins is 240 km. Not 
surprisingly, the estimate of the effective range indicates that the spatial correlation 
of wetland condition is only of interest within the river basins and not between them. 

Table 1 Posterior estimates and 95% credible interval for discrete-only model parameters. 



Parameter 




Estimate 


95 % CI 


ft 


Elevation 


0.54 


(0.22, 0.89) 


ft 


Closed tree canopy 


0.40 


(0.20, 0.62) 


ft 


Saline 


0.62 


(0.16, 1.10) 


ft 


Marsh 


0.60 


(0.26, 0.97) 


ft 


Wet meadow 


-0.03 


(-0.28, 0.21) 


ft 


Fen 


1.00 


(0.55, 1.53) 


y<t>2 


Effective Range 


0.88 


(0.45, 1.84) 


0)i 


Native plant cover 


1.00 




0) 2 


Noxious weed cover 


1.37 


(1.00, 1.90) 


0) 3 


Aggressive native cover 


2.54 


(0.89, 6.01) 


0) 4 


Structural diversity 


0.21 


(0.11,0.33) 


0)5 


Fioristic quality 


1.59 


(1.33, 1.91) 


°? 


Native plant cover 


1.00 




ai 


Noxious weed cover 


1.34 


(0.86,2.16) 


°l 


Aggressive native cover 


20.46 


(8.00, 67.64) 


°l 


Structural diversity 


0.89 


(0.67, 1.18) 


°1 


Fioristic quality 


0.36 


(0.22, 0.57) 
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Table 2 Posterior estimates and 95% credible interval for mixed response model parameters. 



Parameter 




Estimate 


95 % CI 


Hi 


Elevation 


0.39 


(0.23, 0.57) 


02 


Closed tree canopy 


0.17 


(0.07, 0.28) 


03 


Saline 


-0.21 


(-0.51,0.07) 


04 


Marsh 


-0.22 


(-0.43, -0.03) 


05 


Wet meadow 


-0.26 


(-0.42,-0.12) 


06 


Fen 


0.22 


(0.05, 0.42) 


3/02 


Effective Range 


0.67 


(0.31, 3.08) 


0)i 


Native plant cover 


1.00 




0) 2 


Noxious weed cover 


1.21 


(0.86, 1.69) 


0) 3 


Aggressive native cover 


5.37 


(2.28, 10.57) 


0) 4 


Structural diversity 


0.38 


(0.25, 0.54) 


«5 


Floristic quality 


1.52 


(1.28, 1.83) 


°? 


Native plant cover 


1.00 




a| 


Noxious weed cover 


1.36 


(0.89,2.18) 


°f 


Aggressive native cover 


11.83 


(4.41, 33.67) 




Structural diversity 


0.61 


(0.46, 0.81) 




Floristic quality 


0.18 


(0.13, 0.24) 



To compare the performance of the discrete-only model to the mixed response 
model, we compute the median squared error loss using the latent response Z. For 
the ordinal metrics, we estimate the "true" value of Z using Q. The squared er- 
ror loss for each metric is similar between the discrete-only model and the mixed 
response model (See Table[5]in Appendix 7.2 1. 

The remaining results presented here are for the discrete-only model because it 
of interest to the ecologists. The multiple correlation statistics (|3j suggest that met- 
ric 5, floristic quality assessment, is most closely correlated with wetland condition 
(Table|3]l and should be ranked most important in evaluating wetland condition. The 
assessments of native plant cover, noxious weed cover, and aggressive native cover 
are moderately related to wetland condition. The structural diversity measurement 
(metric 4) appears to be the least correlated with wetland condition of the five mea- 
surements and therefore is ranked last. Estimates of percent contribution are also 
given in Table |3j where the values are calculated based on the estimate of Rzj\h di- 
vided by the sum of all estimates of Rzj\h f° r 7 = 1, ■ ■ ■ ,5. The percent contribution 
estimates can be used as weights for each of the metrics in estimating the underlying 
wetland condition. The last column in Table [3]reports the current index weights that 
were selected by a group of wetland experts ( fLemly and Gillian j |20 1 2) > . The scien- 
tists believe floristic quality assessment to be the most important. The weight "0 or 
20%" assigns 20% weight to the lower of the noxious weed cover and aggressive 
native cover metrics. Our estimates improve on the current weighting scheme by 
being statistically derived weights for each of the metrics with confidence limits. 
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Table 3 Discrete-only model: For each metric, estimates and 95% credible intervals for the multi- 
ple correlation value, estimates of the percent contribution, and rank in evaluating wetland condi- 
tion. 



Metric 


Parameter 


Est. 


95 % CI 


% Contrib. 95 % CI 


Rank 


Index 


Native plant cover 


«z, 


H 


0.80 


(0.68, 0.88) 


0.23 


(0.20, 0.26) 


3 


20% 


Noxious weed cover 


R z 2 


H 


0.84 


(0.70, 0.92) 


0.24 


(0.21, 0.28) 


2 


or 20% 


Aggressive native cover 


«z 3 


H 


0.58 


(0.23, 0.83) 


0.17 


(0.08, 0.22) 


4 


or 20% 


Structural diversity 


R z 4 


H 


0.28 


(0.09, 0.49) 


0.08 


(0.03, 0.13) 


5 


20% 


Floristic quality 


R z 5 


u 


0.96 


(0.92, 0.98) 


0.28 


(0.25, 0.32) 


1 


40% 



We estimate the latent spatial process H of wetland condition within the North 
Platte and Rio Grande River Basins by drawing from the posterior distribution 
p(Hj\y) for i = 1, . . . ,«. Since the values of H hold no intrinsic value, we rank the 
locations from draws of the posterior distribution. For each draw t in l,...,T, the 
posterior value of Hi is ranked across all i — 1 , . . . , n assigning a posterior rank to 
each location for each draw. We estimate the latent spatial process of wetland condi- 
tion by computing the median of the posterior ranks at each location. A location with 
a median posterior rank falling in the top 20% of ranks indicates that the wetland at 
this particular location is in the top 20% of all wetlands in the region in terms of bi- 
otic condition. Figure [3] shows the median of the posterior ranks across all locations 
within the North Platte and Rio Grande River Basins. Linear interpolation is used 
to provide a relatively smooth surface over the two river basins. Note, however, that 
wetlands are not found continuously over the regions. The color scale and contours 
of the surface are based on the percentile of the median of the posterior ranks over 
all locations. Wetland management efforts should be directed towards areas within 
the river basins with low posterior ranks. For example, the wetlands in the eastern 
region of the Rio Grande River Basin may be of concern. Conversely, land managers 
may wish to preserve wetlands in good condition such as those shown in red in Fig- 
ure [3] Similar plots can be made for the estimates of uncertainty. We performed a 
simulation study to evaluate the model and out-of-sample predictive performance 
(Appendix 7.3 I. The results indicate that our method provides accurate parameter 
estimates, predictions, and predictive coverage for the simulation scenarios that we 
considered. 



6 Discussion 

The multilevel multivariate latent Gaussian process model presented in this paper 
provides a method for evaluating a continuous latent Gaussian process using mixed 
ordinal and continuous multivariate response data. A multivariate latent variable is 
used as the continuous representation of the multivariate mixed response. A second 
latent variable depending on site-specific covariates models the continuous random 
field that is assumed to be driving the multivariate response. Whereas the continu- 
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Fig. 3 Median of the posterior ranks of the latent spatial process encompassing wetland condition 
(H) across space from the discrete-only response model. 



ous latent random field was modeled in this work using a Gaussian process, Lind- 



|gren et aL] ( |2011| l present an approximation to the Gaussian field using a Gaussian 
Markov random field. This approach could accelerate estimation of the parameters 
of the spatial covariance function. 

Our multilevel multivariate latent variable model is used to evaluate the ecologi- 
cal condition of wetlands or other natural resources. Whereas Liu et al. (2005 ) gave 
a general framework for spatial structural equation modeling, the model presented 
here for multivariate response data could be easily replicated or modified for other 
applications. The model is advantageous as it allowed for comparisons of the condi- 
tion of wetlands in two river basins in Colorado across space. Further, in-field mea- 
surements, or metrics, were ranked when evaluating the wetland condition score 
at each particular location. These rankings allow assignment of statistically valid 
weights to the five measurements or metrics. These results will lead to a decrease in 
the time and effort needed for future wetland evaluation. They will also help land 
managers to design and implement effective protocols for maintaining and restoring 
wetland habitats. 

While we have described and applied the model to a problem related to wetland 
condition, the model holds in much larger context. For example, in human health, 
doctors apply a panel of tests to a subject to evaluate health. In this case, the mul- 
tivariate response would be the outcomes of the tests and the covariates would be 
individual information such as gender and body mass index (BMI). 
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7 Appendix 

7.1 Observed Data 

The frequency of the observed ordinal response values for each metric over all n = 
232 locations are summarized in Table|4] Figures [4]and[5]show univariate summaries 
between the each ordinal response and the covariates. 



Table 4 Observed response data by metric 



Ordinal response 



Metric 


1 


2 


3 


4 


5 


Native plant cover 


9 


16 


60 


69 


75 


Noxious weed cover 


1 


6 


10 


50 


165 


Aggressive native cover 


1 


4 


4 


3 


220 


Structural diversity 


5 


15 


82 


108 


22 


Floristic quality 


24 


47 


65 


26 


69 




12345 12345 12345 

Native plant cover Noxious weed cover Aggressive native cover 




12345 12345 
Structural diversity Floristic quality 



Fig. 4 Boxplots of elevation (y-axis) for each ordinal response (x-axis) for each metric. 
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Fig. 5 Boxplots of closed tree canopy (y-axis) for each ordinal response (x-axis) for each metric. 



7.2 Squared error loss 

The discrete-only model and the mixed response model are compared by computing 
the median squared error loss using the posterior predictions of the latent response 
Z. The "true" value of Z for the ordinal metrics is estimated using i5l. To compare 
squared error loss across models and metrics, we scale each loss value by the vari- 
ance of its "true" value of Z. The standardized loss for each location i and metric j 
is computed using the posterior draws as 

1 , 2 (7) 

where zf^ is the m th draw of Zy and Zy is the true value of the continuous rep- 
resentation of the observed ordinal response, Yy. For a continuous response metric, 
ai is the variance of the response vector Y , since Z , is observed. For an ordinal 

response metric, d? is the variance of Z,, which is based on the MCMC draws. 
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The resulting loss for each metric is similar between the discrete-only model and 
the mixed response model (Table [5]). 



Table 5 Median squared error loss comparison between the two models. 



Metric 


Discrete-Only Model 
Loss Estimate 


Mixed Response Model 
Loss Estimate 


Native plant cover 


0.73 


0.75 


Noxious weed cover 


0.74 


0.83 


Aggressive native cover 


1.70 


1.04 


Structural diversity 


0.96 


0.93 


Floristic quality 


0.58 


0.37 



7.3 Simulation Study 

To evaluate the performance of our methods we simulated data based on the multi- 
variate multilevel latent variable model with three discrete response metrics. Three 
datasets were simulated as outlined below and the model was estimated for each 
dataset. The outcomes of the three simulation were similar and therefore we report 
the results of only one. We define our spatial domain of interest as a 3 x 3 spatial grid 
and simulated 300 locations uniformly over the region. We use the first n = 200 lo- 
cations to fit the proposed model and the remaining m — 100 locations for prediction 
(Figure [§}. 

We randomly simulated two multivariate random variables over the spatial do- 
main to use as covariates Xi and X2 in the mean of H where H is drawn according 
([3]). Values for the coefficient vector /3 = (ft , ft) were fixed at 0.22 and 0.95, respec- 
tively. The true parameter of the covariance of H was fixed at §\ = 1 and <p2 = 15.76. 
Note that 0i is the sill parameter of the covariance and <p2 is the range parameter of 
the exponential correlation function. The effective range of the exponential correla- 
tion function is 3/<p2 = 0.19. In this simulation, the spatial correlation is only large 
at locations at close proximity. The true latent spatial random variable H was a ran- 
dom drawn from a multivariate normal distribution with mean X/3 and covariance 
£h{§) = 01 exp~ d< ^ 2 where d is the n x n matrix of distances between sample lo- 
cations. The true fixed effects 9 and co were randomly drawn from independent 
uniform and normal distributions, respectively where 9 = {1.73,3.80,3.62} and 
CO = {1.00,1.37,-0.77}. The true values of d\, 62, and 63 were all chosen to be 
positive to ensure that the observed ordinal response values spanned each of the 
K = 5 categories. For j = 1,2,3, the true continuous latent random vector Zj was 
drawn from its multivariate normal distribution with mean 9jl + CQjH and variance 
(J?I„. The length 6 vector of threshold values A was fixed such that Ao = — °°, Ai = 0, 
and A5 = °°. The other thresholds were drawn from a multivariate normal distribu- 
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Fig. 6 Three hundred simulated locations within 3x3 grid. Two hundred locations used to fit the 
model and the remaining one hundred were used for model evaluation. 



tion on the transformed scale and then were back transformed. This was done to 
preserve the constraint that A* < A^+i. The resulting true threshold was set to 

A = {-00,0, 1.81, 3.26, 4.71, oc}. 

The observed ordinal response data Y ; for metrics j = 1,...,3 are in the set 
{1, . . . ,5} based on the values of Zj and the true threshold vector A. 

We ran the MCMC algorithm for 100,000 iterations and discarded the first 10,000 
as burn-in. The true parameter values as well as the posterior median and 95% cred- 
ible intervals are given in Table [6] The results show that all but two parameters, ©2 
and are captured their respective credible interval. 

Using the posterior draws of the model parameters, we make predictions using 
the Bayesian posterior prediction distributions p(Y\y) and p(H|y). We evaluated the 
predictive ability of the model by comparing the mode of the posterior prediction 
distribution to the true metric value at each site for each metric (Table |7J. Of the 
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Table 6 Simulated parameter values and posterior median estimates and 95% credible interval 
from model output. 



Para mptpr 


T*tiip Valnp 


Pstimatf* 

1 , 1 1 I 1 1 1 1 il I \- 


95 % CI 


Hi 


0.22 


0.21 


(0 04 39") 


fa 


0.95 


1.15 


(0.90, 1.41) 


3/02 


0.19 


0.31 


(0.17,0.66) 


(Oi 


1.00 


1.00 




Gh 


1.37 


0.73 


(0.47, 1.12) 


(03 


-0.77 


-0.80 


(-1.01, -0.61) 


01 


1.73 


1.44 


(0.73, 2.13) 


02 


3.80 


4.16 


(3.30, 4.93) 


03 


3.62 


4.01 


(3.28,4.83) 


°? 


1.00 


1.00 






2.75 


0.96 


(0.48, 1.97) 




1.41 


1.52 


(1.03,2.18) 



300 predicted metric scores, the truth was captured 57% of the time, whereas the 
predicted metric value was within 1 of the truth 93% of the time. 



Table 7 The posterior modes and true discrete metric response values at the m = 100 new locations 
for all metrics. In bold are the number of correct predictions of metric response values. 





True Value 


Posterior Median 


1 


2 


3 


4 


5 


1 


1 


1 











2 


10 


15 


8 


3 





3 


5 


29 


36 


25 


3 


4 





2 


4 


15 


10 


5 


1 


2 


5 


22 


103 



Capturing the latent random field H is of primary focus in this work. We make 
predictions of H at m = 100 new locations by taking draws from the Bayesian 
posterior prediction distribution p(H|y). 95% posterior prediction intervals of H at 
m =100 new locations indicate that only one interval fail to capture the true value 
(Figure|7]). This indicates that our method achieves appropriate predictive coverage. 
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Fig. 7 95% posterior prediction intervals for latent spatial field H at m = 100 new locations. The 
true value is captured in 99 of the intervals. 



